Join Size Estimation Subject to Filter Conditions

نویسندگان

  • David Vengerov
  • Andre Cavalheiro Menck
  • Mohamed Zaït
  • Sunil Chakkappen
چکیده

In this paper, we present a new algorithm for estimating the size of equality join of multiple database tables. The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can then be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions, possibly specified over multiple columns (attributes) of each table. This algorithm makes a single pass over the data and is thus suitable for streaming scenarios. We compare this algorithm analytically to two other previously known sampling approaches (independent Bernoulli Sampling and End-Biased Sampling) and to a novel sketch-based approach. We also compare these four algorithms experimentally and show that results fully correspond to our analytical predictions based on derived expressions for the estimator variances, with Correlated Sampling giving the best estimates in a large range of situations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Heap-Filter Merge Join: A new algorithm for joining medium-size relations

We present a new algorithm for relational equi-join. The algorithm is a modification of merge join but promises superior performance for medium-size relations. In many cases, it even compares favorably with hybrid hash join.

متن کامل

Similarity Join Size Estimation using Locality Sensitive Hashing

Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the ...

متن کامل

Power-Law Based Estimation of Set Similarity Join Size

We propose a novel technique for estimating the size of set similarity join. The proposed technique relies on a succinct representation of sets using Min-Hash signatures. We exploit frequent patterns in the signatures for the Set Similarity Join (SSJoin) size estimation by counting their support. However, there are overlaps among the counts of signature patterns and we need to use the set Inclu...

متن کامل

Join Size Estimation on Boolean Tensors of RDF Data

The Resource Description Framework (rdf) represents information as subject–predicate–object triples. These triples are commonly interpreted as a directed labelled graph. We instead interpret the data as a 3-way Boolean tensor. Standard sparql queries then can be expressed using elementary Boolean algebra operations. We show how this representation helps to estimate the size of joins. Such estim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2015